Skip to content

fix(api): close private dataset doc auth bypass#14749

Open
Parvezkhan0 wants to merge 2 commits into
infiniflow:mainfrom
Parvezkhan0:fix/private-dataset-doc-auth-14659
Open

fix(api): close private dataset doc auth bypass#14749
Parvezkhan0 wants to merge 2 commits into
infiniflow:mainfrom
Parvezkhan0:fix/private-dataset-doc-auth-14659

Conversation

@Parvezkhan0
Copy link
Copy Markdown

@Parvezkhan0 Parvezkhan0 commented May 10, 2026

What problem does this PR solve?

Fixes #14659

This PR fixes an authorization bypass in the token-authenticated document SDK APIs for private datasets (permission = me).
The affected routes validated dataset access with KnowledgebaseService.accessible(), but when a login token was used through the SDK token flow, the check received the caller's tenant/workspace id instead of the authenticated user id. That let another member of the same tenant pass access checks for a private dataset if they knew the dataset_id, exposing document listings and allowing document operations that should have remained owner-only.

This change forwards the authenticated user id through the token-auth flow and uses it for dataset authorization in the affected SDK document endpoints, so private datasets remain accessible only to their owner while team datasets continue to work as before.

Type of change

  • Bug Fix (non-breaking change which fixes an issue)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 10, 2026 13:31
@dosubot dosubot Bot added the size:M This PR changes 30-99 lines, ignoring generated files. label May 10, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 10, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

Routes in api/apps/sdk/doc.py now accept optional authenticated_user_id and use a new _dataset_access_actor_id helper to derive actor_id (prefer authenticated_user_id). Route authorization calls to KnowledgebaseService.accessible(...) use that actor_id. token_required inspects handler signatures and injects authenticated_user_id when resolving login-token users. Tests updated/added to verify behavior.

Changes

User-Based Authorization for SDK Routes

Layer / File(s) Summary
Authorization identity and route signatures
api/apps/sdk/doc.py
Adds _dataset_access_actor_id(tenant_id, authenticated_user_id) and updates download, parse, stop_parsing, retrieval_test signatures to accept authenticated_user_id: str | None = None.
Route authorization calls
api/apps/sdk/doc.py
download, parse, stop_parsing, and retrieval_test compute actor via helper and pass it as user_id to KnowledgebaseService.accessible(...) instead of using tenant_id.
Authentication decorator
api/utils/api_utils.py
token_required inspects the wrapped handler signature for authenticated_user_id and conditionally injects it into kwargs when resolving a login-token JWT user.
Tests & Verification
test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py, test/unit_test/api/utils/test_api_utils_token_required.py
Updated test_download_and_download_doc_errors to use KnowledgebaseService.accessible; added test_sdk_routes_use_authenticated_user_for_dataset_access to assert routes pass authenticated_user_id; added unit test validating token_required injects authenticated_user_id for login tokens.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

  • #14659: Bug report describing TDocument APIs leaking private datasets to tenant members — this PR supplies per-user actor wiring so accessible(...) can enforce owner-only access.

Possibly related PRs

  • infiniflow/ragflow#14645: Changes KnowledgebaseService.accessible() and DocumentService.accessible() to enforce per-dataset permission rules; complements this PR's actor wiring.
  • infiniflow/ragflow#14782: Related adjustments in api/apps/sdk/doc.py around retrieval_test authorization checks.
  • infiniflow/ragflow#14910: Modifies download-route authorization wiring related to tenant/user identity.

Suggested reviewers

  • wangq8
  • Lynn-Inf

Poem

🐰 I hop through code with nimble feet,
Passing user ids where checks must meet.
Tenant or user, the helper decides,
Now private datasets stay safe inside.
Hooray — secure access, carrot-feast!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 21.05% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: fixing an authorization bypass in private dataset document access for SDK APIs.
Description check ✅ Passed The PR description follows the template with both required sections completed: a clear problem statement with issue reference and type of change (Bug Fix) is provided.
Linked Issues check ✅ Passed The PR directly addresses issue #14659 by forwarding authenticated user IDs through token-auth flows and using them for dataset authorization checks, enforcing owner-only access for private datasets.
Out of Scope Changes check ✅ Passed All changes are scoped to fixing the authorization bypass: helper function for actor ID derivation, decorator enhancements to pass authenticated user ID, route signature updates, and comprehensive tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@dosubot dosubot Bot added 🐖api The modified files are located under directory 'api/apps/sdk' 🐞 bug Something isn't working, pull request that fix bug. 🧪 test Pull requests that update test cases. labels May 10, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
api/apps/sdk/doc.py (1)

53-55: ⚡ Quick win

Add a docstring to explain the access control logic.

The function handles important authorization semantics (JWT-authenticated user vs API-key tenant fallback) without documentation. The suggested docstring clarifies when each path is used:

📝 Suggested docstring
 def _dataset_access_actor_id(tenant_id: str, authenticated_user_id: str | None = None) -> str:
+    """
+    Determine the actor ID for dataset access authorization.
+    
+    Returns authenticated_user_id when available (JWT/login-token flow),
+    otherwise falls back to tenant_id (API-key flow or unauthenticated).
+    """
     return authenticated_user_id or tenant_id
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/apps/sdk/doc.py` around lines 53 - 55, Add a clear docstring to
_dataset_access_actor_id describing the access-control semantics: explain that
when a JWT-authenticated user ID is present the function returns that user (used
for per-user authorization/audit), and when authenticated_user_id is None it
falls back to the tenant_id (used for API-key based requests or tenant-scoped
actions); mention expected parameter types and the function's return value and
include a short example of both paths for clarity.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@api/apps/sdk/doc.py`:
- Around line 208-211: The authorization check using
KnowledgebaseService.accessible(kb_id=dataset_id,
user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id)) lacks
logging; add logging similar to the download flow: log an info/debug message
before the check indicating the dataset_id and actor_id being validated and log
a warning/error when access is denied (including dataset_id and actor_id) so
operators can trace authorization failures; update the same function that
performs this check to call the logger and follow the existing logging format
used by the download function.
- Around line 302-305: The authorization check using
KnowledgebaseService.accessible(kb_id=dataset_id,
user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id)) is missing
logging; add the same logging pattern used in the download/parse flows to record
the check input (dataset_id, tenant_id, authenticated_user_id) and the
authorization outcome. Specifically, before/around the if, emit a log entry
(matching the existing logger level/format used in download/parse) that includes
dataset_id, tenant_id, actor id from _dataset_access_actor_id(...), and whether
access was granted/denied so the decision is auditable and consistent with other
flows.
- Around line 432-435: The dataset access check loop lacks logging; add
structured logs using the existing actor id and kb_ids: log a single INFO-level
message before the loop mentioning kb_ids and actor_id (from
_dataset_access_actor_id), and inside the loop log a WARN/ERROR when
KnowledgebaseService.accessible(kb_id=id, user_id=actor_id) returns False
including the denied kb id and actor_id (and authenticated_user_id if available)
before returning get_error_data_result so denied access events are recorded for
debugging and audit.
- Around line 97-100: Add structured logging around the
KnowledgebaseService.accessible check: log the attempted access with dataset_id,
the actor returned by _dataset_access_actor_id(tenant_id,
authenticated_user_id), and the authorization result (allowed/denied). Use the
module logger (e.g., logger or security logger) and emit a warning or info-level
entry when access is denied (include tenant_id and authenticated_user_id as
context), so the authorization decision for KnowledgebaseService.accessible is
recorded for audit and debugging.

In `@api/utils/api_utils.py`:
- Around line 339-340: The code path that injects authenticated_user_id into
kwargs (the block checking accepts_authenticated_user_id and setting
kwargs["authenticated_user_id"] = user[0].id) needs an audit log entry; add a
concise log statement (using the module logger or existing logger variable)
immediately before or after the assignment that records the action and key
context such as the injected user id and the target dataset/request identifier
(if available) and use an appropriate level (info/debug) while avoiding
sensitive data; update any import or logger initialization (e.g., logger =
logging.getLogger(__name__)) if not already present.

---

Nitpick comments:
In `@api/apps/sdk/doc.py`:
- Around line 53-55: Add a clear docstring to _dataset_access_actor_id
describing the access-control semantics: explain that when a JWT-authenticated
user ID is present the function returns that user (used for per-user
authorization/audit), and when authenticated_user_id is None it falls back to
the tenant_id (used for API-key based requests or tenant-scoped actions);
mention expected parameter types and the function's return value and include a
short example of both paths for clarity.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 272a4cf4-6bed-4bf2-a7f7-55ab6fde80c1

📥 Commits

Reviewing files that changed from the base of the PR and between 6bfe0f9 and 60e606a.

📒 Files selected for processing (4)
  • api/apps/sdk/doc.py
  • api/utils/api_utils.py
  • test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py
  • test/unit_test/api/utils/test_api_utils_token_required.py

Comment thread api/apps/sdk/doc.py
Comment on lines +97 to +100
if not KnowledgebaseService.accessible(
kb_id=dataset_id,
user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id),
):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add logging for dataset access authorization checks.

This security-critical authorization check determines access to private datasets but lacks logging. Adding a log statement would improve security audit trails and debugging.

🔒 Proposed logging addition
+    actor_id = _dataset_access_actor_id(tenant_id, authenticated_user_id)
+    logging.debug("Checking dataset access: dataset_id=%s actor_id=%s (authenticated_user_id=%s)", dataset_id, actor_id, authenticated_user_id)
     if not KnowledgebaseService.accessible(
         kb_id=dataset_id,
-        user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id),
+        user_id=actor_id,
     ):
+        logging.warning("Dataset access denied: dataset_id=%s actor_id=%s", dataset_id, actor_id)
         return get_error_data_result(message=f"You do not own the dataset {dataset_id}.")

As per coding guidelines, "**/*.py: Add logging for new flows".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/apps/sdk/doc.py` around lines 97 - 100, Add structured logging around the
KnowledgebaseService.accessible check: log the attempted access with dataset_id,
the actor returned by _dataset_access_actor_id(tenant_id,
authenticated_user_id), and the authorization result (allowed/denied). Use the
module logger (e.g., logger or security logger) and emit a warning or info-level
entry when access is denied (include tenant_id and authenticated_user_id as
context), so the authorization decision for KnowledgebaseService.accessible is
recorded for audit and debugging.

Comment thread api/apps/sdk/doc.py
Comment on lines +208 to +211
if not KnowledgebaseService.accessible(
kb_id=dataset_id,
user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id),
):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add logging for dataset access authorization checks.

Similar to the download function, this authorization check lacks logging. Consider adding the same logging pattern here for consistency.

As per coding guidelines, "**/*.py: Add logging for new flows".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/apps/sdk/doc.py` around lines 208 - 211, The authorization check using
KnowledgebaseService.accessible(kb_id=dataset_id,
user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id)) lacks
logging; add logging similar to the download flow: log an info/debug message
before the check indicating the dataset_id and actor_id being validated and log
a warning/error when access is denied (including dataset_id and actor_id) so
operators can trace authorization failures; update the same function that
performs this check to call the logger and follow the existing logging format
used by the download function.

Comment thread api/apps/sdk/doc.py
Comment on lines +302 to +305
if not KnowledgebaseService.accessible(
kb_id=dataset_id,
user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id),
):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add logging for dataset access authorization checks.

Similar to the download and parse functions, this authorization check lacks logging. Consider adding the same logging pattern for consistency.

As per coding guidelines, "**/*.py: Add logging for new flows".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/apps/sdk/doc.py` around lines 302 - 305, The authorization check using
KnowledgebaseService.accessible(kb_id=dataset_id,
user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id)) is missing
logging; add the same logging pattern used in the download/parse flows to record
the check input (dataset_id, tenant_id, authenticated_user_id) and the
authorization outcome. Specifically, before/around the if, emit a log entry
(matching the existing logger level/format used in download/parse) that includes
dataset_id, tenant_id, actor id from _dataset_access_actor_id(...), and whether
access was granted/denied so the decision is auditable and consistent with other
flows.

Comment thread api/apps/sdk/doc.py
Comment on lines +432 to 435
actor_id = _dataset_access_actor_id(tenant_id, authenticated_user_id)
for id in kb_ids:
if not KnowledgebaseService.accessible(kb_id=id, user_id=tenant_id):
if not KnowledgebaseService.accessible(kb_id=id, user_id=actor_id):
return get_error_data_result(f"You don't own the dataset {id}.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add logging for dataset access authorization checks.

This function checks access to potentially multiple datasets but lacks logging. Since this loops over dataset IDs, consider logging once before the loop (with the full list) and/or within the loop for denied access.

🔍 Proposed logging addition
     actor_id = _dataset_access_actor_id(tenant_id, authenticated_user_id)
+    logging.debug("Checking access to datasets: dataset_ids=%s actor_id=%s (authenticated_user_id=%s)", kb_ids, actor_id, authenticated_user_id)
     for id in kb_ids:
         if not KnowledgebaseService.accessible(kb_id=id, user_id=actor_id):
+            logging.warning("Dataset access denied: dataset_id=%s actor_id=%s", id, actor_id)
             return get_error_data_result(f"You don't own the dataset {id}.")

As per coding guidelines, "**/*.py: Add logging for new flows".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
actor_id = _dataset_access_actor_id(tenant_id, authenticated_user_id)
for id in kb_ids:
if not KnowledgebaseService.accessible(kb_id=id, user_id=tenant_id):
if not KnowledgebaseService.accessible(kb_id=id, user_id=actor_id):
return get_error_data_result(f"You don't own the dataset {id}.")
actor_id = _dataset_access_actor_id(tenant_id, authenticated_user_id)
logging.debug("Checking access to datasets: dataset_ids=%s actor_id=%s (authenticated_user_id=%s)", kb_ids, actor_id, authenticated_user_id)
for id in kb_ids:
if not KnowledgebaseService.accessible(kb_id=id, user_id=actor_id):
logging.warning("Dataset access denied: dataset_id=%s actor_id=%s", id, actor_id)
return get_error_data_result(f"You don't own the dataset {id}.")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/apps/sdk/doc.py` around lines 432 - 435, The dataset access check loop
lacks logging; add structured logs using the existing actor id and kb_ids: log a
single INFO-level message before the loop mentioning kb_ids and actor_id (from
_dataset_access_actor_id), and inside the loop log a WARN/ERROR when
KnowledgebaseService.accessible(kb_id=id, user_id=actor_id) returns False
including the denied kb id and actor_id (and authenticated_user_id if available)
before returning get_error_data_result so denied access events are recorded for
debugging and audit.

Comment thread api/utils/api_utils.py
Comment on lines +339 to +340
if accepts_authenticated_user_id:
kwargs["authenticated_user_id"] = user[0].id
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add logging for the authenticated user injection flow.

This security-relevant code path injects the authenticated user ID for dataset authorization checks but lacks logging. Adding a log statement would improve observability and audit trails for private dataset access.

📊 Proposed logging addition
                    kwargs["tenant_id"] = tenants[0].tenant_id
                    if accepts_authenticated_user_id:
+                       logging.debug("JWT authentication: injecting authenticated_user_id=%s for tenant_id=%s", user[0].id, tenants[0].tenant_id)
                        kwargs["authenticated_user_id"] = user[0].id

As per coding guidelines, "**/*.py: Add logging for new flows".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/utils/api_utils.py` around lines 339 - 340, The code path that injects
authenticated_user_id into kwargs (the block checking
accepts_authenticated_user_id and setting kwargs["authenticated_user_id"] =
user[0].id) needs an audit log entry; add a concise log statement (using the
module logger or existing logger variable) immediately before or after the
assignment that records the action and key context such as the injected user id
and the target dataset/request identifier (if available) and use an appropriate
level (info/debug) while avoiding sensitive data; update any import or logger
initialization (e.g., logger = logging.getLogger(__name__)) if not already
present.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an authorization bypass in SDK document endpoints when callers authenticate using a login token: the SDK auth layer now forwards the authenticated user id (in addition to tenant/workspace id), and the SDK document routes use that user id for dataset authorization so permission = me datasets remain owner-only.

Changes:

  • Update token_required to optionally inject authenticated_user_id for login-token authentication when the wrapped handler accepts it.
  • Update SDK document routes to authorize dataset access using the authenticated user id (falling back to tenant id when not available).
  • Add/adjust unit tests to verify correct user-id propagation and that SDK document routes pass the correct actor id into KnowledgebaseService.accessible().

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
api/utils/api_utils.py Enhances token_required to conditionally pass authenticated_user_id for login-token flows.
api/apps/sdk/doc.py Uses authenticated user id for dataset access checks in SDK document endpoints (download/parse/stop/retrieval).
test/unit_test/api/utils/test_api_utils_token_required.py New unit test validating token_required injects authenticated user id for login tokens.
test/testcases/test_http_api/test_file_management_within_dataset/test_doc_sdk_routes_unit.py Updates existing unit tests and adds coverage ensuring SDK routes call dataset access checks with user_id=authenticated_user_id.

@KevinHuSh KevinHuSh added the ci Continue Integration label May 19, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
api/apps/sdk/doc.py (1)

98-102: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add audit logging around the new authorization decision points.

The new actor-based access checks return early on denial but still don’t log the decision path. Please add a consistent pre-check and denied-check log for these paths.

🔧 Suggested pattern
+    actor_id = _dataset_access_actor_id(tenant_id, authenticated_user_id)
+    logging.debug(
+        "sdk.dataset_access.check dataset_id=%s actor_id=%s tenant_id=%s authenticated_user_id=%s",
+        dataset_id, actor_id, tenant_id, authenticated_user_id
+    )
-    if not KnowledgebaseService.accessible(
-        kb_id=dataset_id,
-        user_id=_dataset_access_actor_id(tenant_id, authenticated_user_id),
-    ):
+    if not KnowledgebaseService.accessible(kb_id=dataset_id, user_id=actor_id):
+        logging.warning(
+            "sdk.dataset_access.denied dataset_id=%s actor_id=%s tenant_id=%s",
+            dataset_id, actor_id, tenant_id
+        )
         return get_error_data_result(message=f"You do not own the dataset {dataset_id}.")

As per coding guidelines, "**/*.py: Add logging for new flows".

Also applies to: 220-223, 314-317, 444-447

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/apps/sdk/doc.py` around lines 98 - 102, Add audit logging around the
KnowledgebaseService.accessible authorization checks: before calling
KnowledgebaseService.accessible (in the block using _dataset_access_actor_id and
any similar checks at the other noted locations) log a pre-check audit entry
with tenant_id, authenticated_user_id, dataset_id and the actor id being used;
if accessible returns False, log a denied audit entry with the same context and
the decision reason, then return get_error_data_result as before. Use the same
logger used elsewhere in this module and include function names or request ids
where available to correlate logs (reference: KnowledgebaseService.accessible,
_dataset_access_actor_id, get_error_data_result).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@api/apps/sdk/doc.py`:
- Around line 98-102: Add audit logging around the
KnowledgebaseService.accessible authorization checks: before calling
KnowledgebaseService.accessible (in the block using _dataset_access_actor_id and
any similar checks at the other noted locations) log a pre-check audit entry
with tenant_id, authenticated_user_id, dataset_id and the actor id being used;
if accessible returns False, log a denied audit entry with the same context and
the decision reason, then return get_error_data_result as before. Use the same
logger used elsewhere in this module and include function names or request ids
where available to correlate logs (reference: KnowledgebaseService.accessible,
_dataset_access_actor_id, get_error_data_result).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1ea4335b-e40e-4c1a-975b-01a51b72f698

📥 Commits

Reviewing files that changed from the base of the PR and between 60e606a and 56d716d.

📒 Files selected for processing (2)
  • api/apps/sdk/doc.py
  • api/utils/api_utils.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • api/utils/api_utils.py

@codecov
Copy link
Copy Markdown

codecov Bot commented May 19, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 94.16%. Comparing base (db9e782) to head (56d716d).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #14749   +/-   ##
=======================================
  Coverage   94.16%   94.16%           
=======================================
  Files          10       10           
  Lines         703      703           
  Branches      112      112           
=======================================
  Hits          662      662           
  Misses         25       25           
  Partials       16       16           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@KevinHuSh
Copy link
Copy Markdown
Collaborator

KevinHuSh commented May 19, 2026

Appreciations!

Does this mean that all the functions using @token_required needs to delcare one more parameter authenticated_user_id?

dripsmvcp added a commit to dripsmvcp/ragflow that referenced this pull request May 20, 2026
`POST /api/v1/dify/retrieval` resolved the caller via @apikey_required
(injecting `tenant_id`) but then fetched the requested `knowledge_id`
with no tenant filter and ran the full retrieval pipeline against
`kb.tenant_id` (the owner). Any valid Dify-compatible API key could
retrieve chunks from any tenant whose KB UUID was known.

Mirror the pattern PR infiniflow#14749 adds to the sibling sdk/doc.py routes:
after the existing `KnowledgebaseService.get_by_id` lookup, call
`KnowledgebaseService.accessible(kb_id, tenant_id)` and return
`RetCode.AUTHENTICATION_ERROR` ("No authorization.") when it returns
False. No behavior change for owners or for team members already
allowed by the existing accessible() rules.

Closes infiniflow#15027
dripsmvcp added a commit to dripsmvcp/ragflow that referenced this pull request May 21, 2026
`POST /api/v1/dify/retrieval` resolved the caller via @apikey_required
(injecting `tenant_id`) but then fetched the requested `knowledge_id`
with no tenant filter and ran the full retrieval pipeline against
`kb.tenant_id` (the owner). Any valid Dify-compatible API key could
retrieve chunks from any tenant whose KB UUID was known.

Mirror the pattern PR infiniflow#14749 adds to the sibling sdk/doc.py routes:
after the existing `KnowledgebaseService.get_by_id` lookup, call
`KnowledgebaseService.accessible(kb_id, tenant_id)` and return
`RetCode.AUTHENTICATION_ERROR` ("No authorization.") when it returns
False. No behavior change for owners or for team members already
allowed by the existing accessible() rules.

Closes infiniflow#15027
wangq8 pushed a commit that referenced this pull request May 21, 2026
POST /api/v1/dify/retrieval resolved the caller via @apikey_required
(injecting tenant_id) but then fetched the requested knowledge_id with
no tenant filter and ran the full retrieval pipeline against
kb.tenant_id (the owner). Any valid Dify-compatible API key could
retrieve chunks from any tenant whose KB UUID was known. Adds the
missing ownership check.

## Root Cause
api/apps/sdk/dify_retrieval.py line 253:
KnowledgebaseService.get_by_id(kb_id) fetched the KB by id alone, then
the handler used kb.tenant_id (the OWNER) to build the embedding model
and call the retriever. The caller tenant_id was only used downstream at
line 278 for retrieval_by_children, well after cross-tenant data was
already retrieved.

grep confirmed there was no KnowledgebaseService.accessible call
anywhere in the handler.

## Fix
Two-line guard immediately after the existing get_by_id lookup,
mirroring the pattern PR #14749 lands for the sibling sdk/doc.py routes
(download, parse, stop_parsing, retrieval_test):

    e, kb = KnowledgebaseService.get_by_id(kb_id)
    if not e:
return build_error_result(message="Knowledgebase not found!",
code=RetCode.NOT_FOUND)
+   if not KnowledgebaseService.accessible(kb_id, tenant_id):
+ return build_error_result(message="No authorization.",
code=RetCode.AUTHENTICATION_ERROR)
    if kb.tenant_embd_id:
        ...

KnowledgebaseService.accessible already handles solo-tenant ownership,
team membership via TenantService.get_joined_tenants_by_user_id, and the
permission=ME distinction. No behavior change for legitimate callers;
cross-tenant callers now receive RetCode.AUTHENTICATION_ERROR (109).

## Test Plan
- [x] Regression test added:
test/unit_test/api/apps/sdk/test_dify_retrieval.py
- test_cross_tenant_request_is_rejected -- attacker tenant calling owner
tenant KB gets 109; retriever is not invoked
- test_same_tenant_request_succeeds -- owner tenant gets the records
back
- test_missing_knowledge_base_returns_not_found -- missing KB returns
404 BEFORE the access check fires (legit callers see the clearer
message)
- [x] All 3 tests pass after the fix
- [x] Cross-tenant test FAILS on pre-fix main (KeyError on result[code]
because handler leaks records dict instead of returning auth error)
- [x] ruff check clean on both changed files
- [x] No drive-by reformatting in dify_retrieval.py -- only the 2 added
lines

### Post-fix output

    test_cross_tenant_request_is_rejected           PASSED [ 33%]
    test_same_tenant_request_succeeds               PASSED [ 66%]
    test_missing_knowledge_base_returns_not_found   PASSED [100%]

============================== 3 passed in 0.04s
===============================

Closes #15027
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🐖api The modified files are located under directory 'api/apps/sdk' 🐞 bug Something isn't working, pull request that fix bug. ci Continue Integration size:M This PR changes 30-99 lines, ignoring generated files. 🧪 test Pull requests that update test cases.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: TDocument APIs leak private datasets to tenant members

3 participants